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Abstract. Unsupervised feature selection has been always attracting re¬ 
search attention in the communities of machine learning and data mining 
for decades. In this paper, we propose an unsupervised feature selection 
method seeking a feature coefficient matrix to select the most distinctive 
features. Specifically, our proposed algorithm integrates the Maximum 
Margin Criterion with a sparsity-based model into a joint framework, 
where the class margin and feature correlation are taken into account at 
the same time. To maximize the total data separability while preserving 
minimized within-class scatter simultaneously, we propose to embed K- 
means into the framework generating pseudo class label information in a 
scenario of unsupervised feature selection. Meanwhile, a sparsity-based 
model, £ 2 ,p-norm, is imposed to the regularization term to effectively 
discover the sparse structures of the feature coefficient matrix. In this 
way, noisy and irrelevant features are removed by ruling out those fea¬ 
tures whose corresponding coefficients are zeros. To alleviate the local 
optimum problem that is caused by random initializations of K-means, 
a convergence guaranteed algorithm with an updating strategy for the 
clustering indicator matrix, is proposed to iteratively chase the optimal 
solution. Performance evaluation is extensively conducted over six bench¬ 
mark data sets. From plenty of experimental results, it is demonstrated 
that our method has superior performance against all other compared 
approaches. 
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1 Introduction 


Over the past few years, data are more than often represented by high-dimensional 
features in a number of research fields, such as data mining |29) . computer vision 
pS] . etc. With the inventions of such many sophisticated data representations, 
a problem has been never lack of research attention: How to select the most dis¬ 
tinctive features from high-dimensional data for subsequent learning tasks, e.g. 
classification? To answer this question, we take two points into account. First, 
the number of selected features should be smaller than the one of all features. 
Due to a lower dimensional representation, the subsequent learning tasks with 
no doubt can gain benefit in terms of efficiency [33]. Second, the selected fea¬ 
tures should have more discriminant power than the original all features. Many 
previous works have proven that removing those noisy and irrelevant features 
can improve discriminant power in most cases. In light of advantages of feature 
selection, different new algorithms have been flourished with various types of 
applications recently. 

According to the types of supervision, feature selection can be generally di¬ 
vided into three categories, i.e. supervised, semi-supervised, and unsupervised 
feature selection algorithms. Representative supervised feature selection algo¬ 
rithms include Fisher score [7], Relief |T3] and its extension, ReliefF infor¬ 
mation gain PU , etc P313T] . Label information of training data points is utilized 
to guide the supervised feature selection methods to seek distinctive subsets of 
features with different search strategies, i.e. complete search, heuristic search, 
and non-deterministic search. In the real world, class information is quite lim¬ 
ited, resulting in the development of semi-supervised feature selection methods 
|:I()l9l:il4| . in which both labeled and unlabeled data are utilized. 

In unsupervised scenarios, feature selection is more challenging, since there is 
no class information to use for selecting features. In the literature, unsupervised 
feature selection can be roughly categorized into three groups, i.e. filter, wrap¬ 
per, and embedded methods. Filter-based unsupervised feature selection methods 
rank features according to some intrinsic properties of data. Then those features 
with higher scores are selected for the further learning tasks. The selection is 
independent to the consequent process. For example. He et al. m assume that 
data from the same class are often close to each other and use the locality pre¬ 
serving power of data, also termed as Laplacian Score, to evaluate importance 
degrees of features. In [33], a unified framework has been proposed for both 
supervised and unsupervised feature selection schemes using a spectral graph. 
Tabakhi et al. [33] have proposed an unsupervised feature selection method to 
select the optimal feature subset in an iterative algorithm, which is based on ant 
colony optimization. Wrapper-based methods as a more sophisticated way wrap 
learning algorithms to yield learned results that will be used to select distinctive 
subsets of features. In [m, for instance, the authors have developed a model that 
selects relevant features using two backward stepwise selection algorithms with¬ 
out prior knowledges of features. Normally, wrapper-based methods have better 
performance than filter-based methods, since they use learning algorithms. Un¬ 
fortunately, the disadvantage is that the computation of wrapper methods is 






more expensive. Embedded methods are seeking a trade-off between them by 
integrating feature selection and clustering together into a joint framework. Be¬ 
cause clustering algorithms are able to provide pseudo labels which can reflect 
the intrinsic information of data, some works |lllbl2()l28| incorporate different 
clustering algorithms in objective functions to select features. 

Most of the existing unsupervised feature selection methods |1()I34I20I27II6TTH 
rely on a graph, e.g. graph Laplacian, to reflect intrinsic relationships among 
data, labeled and unlabeled. When the number of data is extremely large, the 
computational burden of constructing a graph Laplacian is significantly heavy. 
Meanwhile, some traditional feature selection algorithms [HZ] neglect correla¬ 
tions among features. The distinctive features are individually selected according 
to the importance of each feature rather than taking correlations among fea¬ 
tures into account. Recently, exploiting feature correlations has attracted much 
research attention |;f2l2()lblbll9l:f^ . It has proven that discovering feature corre¬ 
lation is beneficial to feature selection. 

In this paper, we propose a graph-free method to select features by com¬ 
bining Maximum Margin Criterion with feature correlation mining into a joint 
framework. Specifically, the method, on one hand, aims to learn a feature coef¬ 
ficient matrix which linearly combines features to maximize the class margins. 
With the increase of the separability of the entire transformed data by maximiz¬ 
ing the total scatter, the proposed method also expects distances between data 
points within the same class to be minimized after the linear transformation by 
the coefficient matrix. Since there is no class information can be borrowed from, 
K-means clustering is jointly embedded in the framework to provide pseudo la¬ 
bels. Inspired by recent feature selection works using sparsity-based model on 
the regularization term |?|, on the other hand, the proposed algorithm learns 
sparse structural information of the coefficient matrix, with the goal of reducing 
noisy and irrelevant features by removing those features whose coefficients are 
zeros. The main contributions of this paper can be summarized as follows: 


— The proposed method makes efforts to maximize class margins in a frame¬ 
work, where simultaneously considers the separability of the transformed 
data and distances between the transformed data within the same class. 
Besides, a sparsity-based regularization model is jointly applied on the fea¬ 
ture coefficient matrix to analyze correlations among features in an iterative 
algorithm; 

— K-means clustering is embedded into the framework generating cluster la¬ 
bels, which can be used as pseudo labels. Both maximizing class margins and 
learning sparse structures can benefit from generated pseudo labels during 
each iteration; 

— Because the performance of K-means is dominated by the initialization, we 
propose a strategy to avoid our algorithm rapidly converge to a local opti¬ 
mum, which is largely ignored by most of existing approaches using K-means 
clustering. Theoretical proof of convergence is also provided. 









— We have conducted extensive experiments over six benchmark datasets. The 
experimental results show that our method has better performance than all 
the compared unsupervised algorithms. 

The rest of this paper is organized as follows: Notations and definitions that 
are used throughout the entire paper will be given in section 2. Our method will 
be elaborated in section 3, followed by proposing its optimization with an algo¬ 
rithm to guarantee the convergence property in section 4. In section 5, extensive 
experimental results are reported with related analysis. Lastly, the conclusion of 
this paper will be given in section 6. 


2 Notations and Definitions 


To give a better understanding of the proposed method, notations and definitions 
which are used throughout this paper are summarized in this section. Matrices 
and vectors are written as boldface uppercase letters and boldface lowercase 
letters, respectively. Given a data set denoted as X = [a;i,..., a:„] e 
where n is the number of training data and d is the feature dimension. The mean 
of data is denoted as x. The feature coefficient matrix, W € , linearly 

combines data features as W"^X, d' is the feature dimension after the linear 
transformation. Given a cluster centroid matrix for the transformed data, G — 
[gi,...,gc] G its cluster indicator of transformed Xi is represented as 

Ui = [uii,... ,Uic]. c is the number of centroids. If transformed Xi belongs to 
the j-th cluster, Uij = 1, j = 1,..., c. Otherwise, Uij = 0. Correspondingly, the 
cluster indicator matrix is U = ..., G R"^'^. 

For an arbitrary matrix M G its £ 2 ,p-norm is defined as: 

X 
p 

( 1 ) 


\\Mh,p = 


EE"' 

i=i \j=i 


The z-th row of M is represented by AT®. The between-class, within-class and 
total scatter matrices of data are respectively defined as: 

C 

Sb = '^ ni{xi - x){xi - x)'^, 

c rii 

= 'y ^ y ~ ~ ) ( 2 ) 

i=i j=i 
n 

St = - x){xi - x)^ 

i=l 

where rii is the number of data for the c-th class. St = Sw + Sb- Other notations 
and definitions will be explained when they are in use. 



3 Proposed Method 


We now introduce our proposed method for unsupervised feature selection. To 
exploit distinctive features, an intuitive way is to find a linear transformation 
matrix which can project the data into a new space where the original data are 
more separable. PCA is the most popular approach to analyze the separability 
of features. PCA aims to seek directions on which transformed data have max 
variances. In other words, PCA is to maximize the separability of linearly trans- 

n 

formed data by maximizing the covariance: max {xi—xj)’^ {xi—xj). 

Without losing the generality, we assume the data has zero mean, i.e. x = 0. 
Recall the definition of total scatter of data, PCA is equivalent to maximize the 
total scatter of data. However, if only total scatter is considered as a separability 
measure, the within-class scatter might be also geometrically maximized with 
the maximization of the total scatter. This is not helpful to distinctive feature 
discovery. The representative model, LDA, solves this problem by maximizing 
Fisher criterion: max However, LDA and its variants require class in¬ 

formation to construct between-class and within-class scatter matrices [2] , which 
is not suitable for unsupervised feature selection. Before we give the objective 
that can solve the aforementioned problem, we first look at a supervised feature 
selection framework: 

n c rii 

2=1 2 = 1 j —1 

s.t. W^W = 7, 

(3) 

where a and /3 are regularization parameters. In this framework, the first term is 
to maximize the total scatter, while the second term is to minimize the within- 
class scatter. The third part is a sparsity-based regularization term which con¬ 
trols the sparsity of W. This model is quite similar with the classical LDA-based 
methods. Due to there is no class information in the unsupervised scenario, we 
need virtual labels to minimize the distances between data within the same class 
while maximize the total separability at the same time. To achieve this goal, we 
apply K-means clustering in our framework to replace the ground truth by gen¬ 
erating cluster indicators of data. Given c centroids G = [gi,... ,gc] G 
the objective function of the traditional K-means algorithm aims to minimize 
the following function: 


-aifiyj-di) 

n 

= “ Guffiyi - Guf), 

2=1 


(4) 


where yi = W'^Xi. Note that K-means is used to assign cluster labels, which 
are used as pseudo labels, to minimize the within-class scatter after the linear 



transformation by W. Then, we can substitute dH) into 


n n 

u^yiY^{W'^x,f(W'^x,) - a Y^{W'^x, - GuJf{W^x, - Guf) - l3f2{W) 

i=l 

s.t. W^W = I, 

^ (5) 

As mentioned above, the sparsity-based regularization term has been widely 
used to find out correlated structures among features. The motivation behind 
this is to exploit sparse structures of the feature coefficient matrix. By imposing 
the sparse constraint, some of the rows of the feature coefficient matrix shrink 
to zeros. Those features corresponding to non-zero coefficients are selected as 
the distinctive subset of features. In this way, noisy and redundant features 
can be removed. This sparsity-based regularization has been applied in various 
problems. Inspired by the ’’shrinking to zero” idea, we utilize a sparsity model 
to uncover the common structures shared by features. To achieve that goal, we 
propose to minimize the £ 2 ,p-norm of the coefficient matrix, ||Vl^|| 2 ,p, (0 < p < 
2). From the definition of ||W|| 2 ,p in (P), outliers or negative impact of the 
irrelevant i(j*’s are suppressed by minimizing the £ 2 ,p-norm. Note that p is a 
parameter that controls the degree of correlated structures among features. The 
lower p is, the more shared structures among are expected to exploit. After a 
number of optimization steps, the optimal feature coefficient matrix, W, can be 
obtained. Thus, we impose the £ 2 ,p"iiorm on the regularization term and re-write 
the objective function in a matrix representation as follows: 


ma^JriW^StW) - - GU^\\% - P\\Wh,p 

s.t. W^W = I, 


( 6 ) 


where U is an indicator matrix. Tr{-) is trace operator, while || • ||p is the 
Frobenius norm of a matrix. Our proposed method integrates the Maximum 
Margin Criterion and sparse regularization into a joint framework. Embedding 
K-means into the framework not only minimizes the distances between within- 
class data while maximizing total data separability, but also provides cluster 
labels. The cluster centroids generated by K-means can further guide the sparse 
structure learning on the feature coefficient matrix in each iterative step of our 
solution, which will be explained in the next section. We name this method for 
the unsupervised feature analysis with class margin optimization as UFCM. 


4 Optimization 

In this section, we present our solution to the objective function in ([S]). Since the 
^ 2 .p-iiorm is used to exploit sparse structures, the objective function cannot be 
solved in a closed form. Meanwhile, the objective function is not jointly convex 
with respect to three variables, i.e. W, G, U. Thus, we propose to solve the 
problem as follows. 


( 7 ) 


We define a diagonal matrix D whose diagonal entries are defined as: 





2-p- 
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The objective function in ([6l) is equivalent to: 


^nax^Tr(W^S'tW) - a\\W^X - GU'^Wj, - PTr{W'^DW) 

s.t. W^W = I 


We propose to optimize the objective function in two steps in each iteration as 
follows: 

(1) Fix W, G and optimize U: 

When W is fixed, the first and third terms can be viewed as constants. While 
the second term can be viewed as the objective function of K-means, assigning 
cluster labels to each data. Also, the cluster centroid matrix G = [gi,... ,gc] is 
also fixed, the optimal U is: 




1 , 

0 , 


j = argmin||W^a:i - gkWp, 

k 

Otherwise. 


(9) 


This is equivalent to perform K-means on the transformed data, which 

means the solution is unique. 

(2) Fix U and optimize W, G: 

After fixing the indicator matrix, 17, we set the derivative of Equation ([5]) 
with respect to G equal to 0: 


dTr{W^X - GV^YiW^X - GU'^) „ 

“- dG -=" 

^ -2aW'^XU + 2aGU'^U = 0 

^G = W'^XUiU^U)-^ 


( 10 ) 


Substituting Equation m into Equation ([5]), we have: 

TriW'^StW) - all - W'^XU{U'^U)-^U'^ fp - PTr{W'^DW) 

= aTr {{W^XU{U^U)-^U'^ - W^X){W^X - XU{U^U)-^U'^f) 

+Tr{W'^StW) - pTr{W^DW) 

= aTr {W'^XU{U^U)-^U^X'^W - W^XX'^W) 

+Tr{W'^StW) - PTriW'^DW) 

= Tr[W'^{St + aXU{U'^U)-^U^X^ - aXX^ - 13D)W] 

( 11 ) 

Thus, the objective function becomes: 

nmxrr[W^(S't -b aXU{U^U)-^U^X^ - aXX^ - pD)W] 

s.t. W^W = I 


( 12 ) 




Algorithm 1 Unsupervised Feature Analysis with Class Margin Optimization. 

Input: Data matrix X = [x\,... ,Xn\ € and parameters a and /3. 

Output: Feature coefficient matrix W and cluster indicator matrix U. 

1: Initialize W by PCA on A; 

2: Initialize U by K-means on W^X-, 

3: repeat 

4: Compute D according to 0; 

5: Update U according to Gl; 

6: Update W by eigen-decomposition of m-, 

7: Update G according to (0; 

8: until Convergence 


The objective function can be then solved by performing eigen-decomposition of 
the following formula: 

St + aXU{U'^U)-^U^X'^-aXX^ - PD (13) 

The optimal W can be determined by choosing d! eigenvectors corresponding 
to d' largest eigenvalues, d' < d. Our proposed method can be solved by above 
steps in an iterative algorithm. Each step can obtain the corresponding optimum. 
As the cluster indicator matrix U is initialized by K-means, the performance 
of our algorithm is determined by the initialization of K-means. To alleviate 
the local optimum problem, an update strategy for U is demanded. Generally 
speaking, we randomly initialize U a number of times and make comparisons 
according to the second term in Equation (jS]). Then we choose how to update the 
indicator matrix. Specifically, the optimal U* and W* has been derived in the 
I-th iteration. In the (i -I- l)-th iteration, we first randomly initialize U r times 
(r = 10 in our experiment) and combine the derived U* in the i-th iteration as an 
updating candidate set: Ui+i = = U*. According 

to ||W^A — the candidate, which yields the smallest value, is chosen 

to update 

U:+, = Ui+i, J = argmin (14) 

where j is the index of candidate set, j = 0,1,..., r. In this way, we compare 
the derived cluster indicator matrix with r randomly initialized counterparts to 
alleviate the local optimum problem. We summarize the solution in Algorithm 
[I] which outputs the learned feature coefficient matrix W to select distinctive 
features. 

From Algorithm [TJ it can be seen that the most computational operation 
is the eigen-decomposition in Equation (1131) . The computational complexity is 
0{d^). If the dimensionality of the data, d, is very high, dimensionality reduction 
is desirable. To analyze the convergence of our proposed method, the following 
proposition and its proof are given. 

Proposition 1. AlgorithmUl monotonically increases the objeetive function in 
Equation until convergence. 






Proof. Assuming that, in the i-th iteration, the transformation matrix W and 
cluster centroid matrix G have been derived as Wi and Gi. In the {i + l)-th 
iteration step, we use and Gi to update CA+i according to the updating 
strategy in m- We can have the following inequality: 

Tr{WlStWi) - a\\W^X - G^Uffp - P\mh,p 
<Tr{WTStWi) - a\\WTX - G,Uf+Al - P\\W^h,p 

Similarly, when is fixed to optimize W and G in the {i + l)-th iteration, 
the following inequality can be obtained according to Equation (HU: 

Tr(W^StW.) - a\\W^X - G.Uf+Al - P\mh,p 

<Tr(W,^i5tW-.+i) - - ^||W,+i||2.p 

After combining Equation (ITSl) and (fTHl) together, it indicates that the proposed 
algorithm will monotonically increase the objective function in each iteration. 
It is worth noting that the algorithm is alleviating the local optimum problem 
raised by random initializations of K-means, rather than completely solving it. 
However, our algorithm can avoid to rapidly converge to a local optimum and 
may converge to the global optimal solution. 


5 Experiments 

In this section, experimental results will be presented together with related 
analysis. We compare our method with seven approaches over six benchmark 
datasets. Besides, we also conduct experiments to evaluate performance varia¬ 
tions in different aspects. They are including the impact of different selected 
feature numbers, the validation of feature correlation analysis, and parameter 
sensitivity analysis. Lastly, the convergence demonstration is shown. 

5.1 Experiment Setup 

In the experiments, we have compared our method with seven approaches as 
follows: 

— All Features: All original variables are preserved as the baseline in the 
experiments. 

— Max Variance: Features are ranked according to the variance magnitude of 
each feature in a descending order. The highest ranked features are selected. 

— Spectral Feature Selection (SPEC) |34j : This method employs a unified 
framework to select features one by one based on spectral graph theory. 

— Multi-Cluster Feature Selection (MCFS) [I]: This unsupervised ap¬ 
proach selects those features who make the multi-cluster structure of the 
data preserved best. Features are selected using spectral regression with the 
.^i-norm regularization. 


— Robust Unsupervised Feature Selection (RUFS) [^D]: RUFS jointly 
performs robust label learning and robust feature learning. To achieve this, 
robust orthogonal nonnegative matrix factorization is applied to learn la¬ 
bels while the ^ 2 ,i-norm minimization is simultaneously utilized to learn the 
features. 

— Nonnegative Discriminative Feature Selection (NDFS) [T^]: NDFS 
exploits local discriminative information and feature correlations simultane¬ 
ously. Besides, the manifold structure information is also considered jointly. 

— Laplacian Score (LapScore) [10]: This method learns and selects distinc¬ 
tive features by evaluating their powers of locality preserving, which is also 
called Laplacian Score. 

All the parameters (if any) are tuned in the range of {10“^, 10“^^, 10^, 10^} 
for each algorithm mentioned above and the best results are reported. The size 
of the neighborhood is set to 5 for any algorithm based on spectral clustering. 
The number of random initializations required in the update strategy in m, is 
set at 10 in the experiment. To measure the performance, two metrics have been 
used: Clustering Accuracy (ACC) and Normalized Mutual Information (NMI). 

For a data point Xi^ its ground truth label is denoted as pi and its clustering 
label that is produced from a clustering algorithm, is represented as qi. Then, 
ACC metric over a data set with n data points is defined as follows: 




(17) 


ACC = 


n 


where 5{x,y) = \ ii x = y and 6{x, y) = 0 otherwise. map{x) is the best mapping 
function which permutes clustering labels to match the ground truth labels using 
the Kuhn-Munkres algorithm. A larger ACC means better performance. 
According to the definition in [24], NMI is defined as: 



NMI = 


(18) 


where ti is the number of data points in the l-th cluster, 1 < ^ < c, which is 
generated by a clustering algorithm. While th denotes the number of data points 
in the h-th ground truth cluster, ti^h is the number of data points which are 
in the intersection of the l-th and h-th clusters. Similarly, a larger NMI means 
better performance. 

The performance evaluations are performed over six benchmark datasets as 
follows: 

— COIL20 [TS]: It contains 1,440 gray-scale images of 20 objects (72 images 
per object) under various poses. The objects are rotated through 360 degrees 
and taken at the interval of 5 degrees. 

- MNIST [H]: It is a large-scale dataset of handwritten digits, which has 
been widely used as a test bed in data mining. The dataset contains 60,000 
training images and 10,000 testing images. In this paper, we use its subclass 






Table 1. Summary of data sets. 



CO1L20 

MNIST 

ORL 

UMIST 

USPS 

YaleB 

Number of data 

1,440 

6,996 

400 

564 

9,298 

2,414 

Number of classes 

20 

10 

40 

20 

10 

38 

Feature dimensions 

1,024 

784 

1,024 

644 

256 

1,024 


version, MNIST-S, in which one handwritten digit image per ten images, for 
each class, is randomly sampled from the MNIST database. There are 6,996 
handwritten images with a resolution of 28x28. 

— ORL [22]: This data set which is used as a benchmark for face recognition, 
consists of 40 different subjects with 10 images each. We also resize each 
image to 32 x 32. 

— UMIST: UMIST, which is also known as the Sheffield Face Database, con¬ 
sists of 564 images of 20 individuals. Each individual is shown in a variety 
of poses from profile to frontal views. 

— USPS [12]: This dataset collects 9,298 images of handwritten digits (0-9) 
from envelops by the U.S. Postal Service. All images have been normalized 
to the same size of 16 x 16 pixels in gray scale. 

— YaleB [8]: It consists of 2,414 frontal face images of 38 subjects. Differ¬ 
ent lighting conditions have been considered in this dataset. All images are 
reshaped into 32 x 32 pixels. 

The pixel value in data is used as the feature. Details of data sets that are used 
in this paper are summarized in Table [TJ 


5.2 Experimental Results 

To compare the performance of our proposed algorithm with others, we repeat¬ 
edly perform the test five times and report the average performance results {ACC 
and NMI) with standard deviations in Tables [2] and [3j It is observed that our 
proposed method consistently achieves better performance than all other com¬ 
pared approaches across all the data sets. Besides, it is worth noting that our 
method is superior to those state-of-the-art counterparts that rely on a graph 
Laplacian (SPEC, RUES, NDFS, LapScore). 

We study how the number of selected features can affect the performance by 
conducting an experiment whose results are shown in Figure [TJ From the figure, 


Table 2. Performance comparison {ACCiiSTD). 



COIL20 

MNIST 

ORL 

UMIST 

USPS 

YaleB 

AllFea 

0.7051 ± 0.0294 

0.6009 ± 0.0063 

0.6675 ± 0.0112 

0.4800 ± 0.0115 

0.7139 ± 0.0272 

0.1261 ± 0.0025 

MaxVar 

0.7124 ± 0.0191 

0.6239 ± 0.0100 

0.6965 ± 0.0121 

0.4984 ± 0.0141 

0.7165 ± 0.0186 

0.1291 ± 0.0042 

SPEC 

0.7105 ± 0.0116 

0.6254 ± 0.0024 

0.6645 ± 0.0065 

0.4824 ± 0.0077 

0.7037 ± 0.0315 

0.1307 ± 0.0049 

MCFS 

0.7355 ± 0.0050 

0.6299 ± 0.0037 

0.7055 ± 0.0048 

0.5239 ± 0.0038 

0.7634 ± 0.0138 

0.1355 ± 0.0043 

RUFS 

0.7365 ± 0.0024 

0.6294 ± 0.0028 

0.6920 ± 0.0033 

0.5110 ± 0.0091 

0.7659 ± 0.0076 

0.1795 ± 0.0032 

NDFS 

0.7368 ± 0.0074 

0.6291 ± 0.0016 

0.7050 ± 0.0031 

0.5243 ± 0.0028 

0.7630 ± 0.0124 

0.1315 ± 0.0034 

LapScore 

0.7126 ± 0.0249 

0.6214 ± 0.0054 

0.7100 ± 0.0117 

0.5092 ± 0.0062 

0.7089 ± 0.0324 

0.1255 ± 0.0025 

Ours 

0 . 7475 ± 0.0076 

0.6392 ± 0.0056 

0.7210 ± 0.0052 

0.5343 ± 0.0062 

0.7813 ± 0.007 

0.1886 ± 0.0043 



























Table 3. Performance comparison {NMI±STD). 



COIL20 

MNIST 

ORL 

UMIST 

USPS 

YaleB 

AIlFea 

0.7884 ± 0.0157 

0.5162 ± 0.0027 

0.8265 ± 0.0129 

0.6715 ± 0.0069 

0.6305 ± 0.0029 

0.1968 ± 0.0017 

MaxVar 

0.7932 ± 0.0071 

0.5314 ± 0.0063 

0.8424 ± 0.0085 

0.6825 ± 0.0063 

0.6361 ± 0.0021 

0.2123 ± 0.0040 

SPEC 

0.7866 ± 0.0061 

0.5367 ± 0.0035 

0.8232 ± 0.0021 

0.6753 ± 0.0114 

0.6215 ± 0.0073 

0.2071 ± 0.0027 

MCFS 

0.8066 ± 0.0025 

0.5367 ± 0.0003 

0.8460 ± 0.0025 

0.7005 ± 0.0053 

0.6419 ± 0.0015 

0.2024 ± 0.0033 

RUFS 

0.8045 ± 0.0025 

0.5374 ± 0.0021 

0.8430 ± 0.0044 

0.6898 ± 0.0035 

0.6468 ± 0.0027 

0.2845 ± 0.0040 

NDFS 

0.8062 ± 0.0058 

0.5376 ± 0.0004 

0.8458 ± 0.0026 

0.6981 ± 0.0054 

0.6452 ± 0.0054 

0.2048 ± 0.0041 

LapScore 

0.7920 ± 0.0101 

0.5308 ± 0.0065 

0.8421 ± 0.0006 

0.6924 ± 0.0027 

0.6291 ± 0.0047 

0.1945 ± 0.0018 

Ours 

0.8119 ± 0.0035 

0.5422 ± 0.0018 

0.8518 ± 0.0027 

0.7112 ± 0.0033 

0.6535 ± 0.0022 

0.2959 ± 0.0043 


C0II2O MNIST USPS 



Fig. 1. Performance variation results with respect to the number of selected features 
using the proposed algorithm over three data sets, COIL20, MNIST, and USPS. 


performance variations with respect to the number of selected features using the 
proposed algorithm over three data sets, including COIL20, MNIST, and USPS, 
have been illustrated. We only adopt ACC as the metric. Some observations 
can be obtained: 1) When the number of selected features is small, e.g. 500 on 
each data set, the accuracy value is relatively small. 2) With the increase of 
selected features, performance can peak at a certain point. For example, the 
performance of our algorithm peaks at 0.7475 on COIL20 when the number of 
selected features increases to 800. Similarly, 0.6392 (800 selected features) and 
0.7813 (600 selected features) are observed on MNIST and USPS, respectively. 
3) When all features are in use, the performance is worse than the best. Similar 
trends can be also observed on the other data sets. It is concluded that our 
algorithm is able to select distinctive features. 

To demonstrate exploiting feature correlation is beneficial to the perfor¬ 
mance, we conduct an experiment in which parameters a and p are both fixed 
at 1. /3 varies in a range of [0,10“^, 10“^, 10“^, 1,10^, 10^, 10^]. The performance 
variation results with respect to different /3s are plotted in Figured The exper¬ 
iment is conducted over three data sets, i.e. COIL20, MNIST, and USPS. From 
the results, we can observe that the performance is relatively low, when there is 
no correlation exploiting in the framework, i.e. /3 = 0. The performance always 
peaks at a certain point when a proper degree of sparsity is imposed to the 
regularization term. For example, the performance is only 0.6993 when /3 = 0 
on COIL20. The performance increases to 0.7285 when /3 = 10^. Similar obser¬ 
vations are also obtained on the other data sets. We can conclude that sparse 
structure learning on feature coefficient matrix contributes to the performance 
of our unsupervised feature selection method. 



























Fig. 2. Performance variation results with respect to different values of regularization 
parameter, jSs, over three data sets, COIL20, MNIST, and USPS. 



Fig. 3. Performance variation results under different combinations of as and pa. 13 is 
fixed at 10”^. 


5.3 Studies on Parameter Sensitivity and Convergence 

There are three parameters in our algorithms, which are denoted as a, P and 
p in ©• a and /3 are two regularization parameters while p controls the de¬ 
gree of sparsity. To investigate the sensitivity of the parameters, we conduct an 
experiment to study how they exert influences on performance. Firstly, we fix 
P = 10“^ and derive the performance variations under different combinations of 
as and ps in Figure [H Secondly, a is fixed at 10“^. The performance variation 
results with respect to different /3s and ps are shown in Figure SI Both a and 
P vary in a range of [10“^, 10“^, 10^, 10^]. While p changes in [0.5,1.0,1.5]. We 
only take ACC as the metric. 

To validate that our algorithm will monotonically increase the objective func¬ 
tion value in we conduct an experiment to demonstrate this fact. In this 






















Fig. 4. Performance variation results under different combinations of /3s and ps. a is 
fixed at 10“^. 


experiment, all parameters (a,/3, and p) in (jS]) are fixed at 1. The objective 
function values and corresponding iteration numbers are drawn in Figure [S) We 
take COIL20, MNIST, and USPS as examples. Similar observations can be also 
obtained on the other data sets. From the figure, it can be seen that our al¬ 
gorithm converges to the optimum, usually within eight iteration steps, over 
three data sets. We can then conclude that the proposed method is efficient and 
effective. 




Fig. 5. Objective function values of our proposed objective function in m over three 
data sets, COIL20, MNIST, and USPS. 























6 Conclusion 


In this paper, an unsupervised feature selection approach has been proposed by 
using the Maximum Margin Criterion and the sparsity-based model. More specif¬ 
ically, the proposed method seeks to maximize the total scatter on one hand. On 
the other hand, the within-class scatter is simultaneously considered to minimize. 
Since there is no label information in an unsupervised scenario, K-means clus¬ 
tering is embedded into the framework jointly. Advantages can be summarized 
as twofold: First, pseudo labels generated by K-means clustering is beneficial to 
maximizing class margins in each iteration step. Second, pseudo labels can guide 
the sparsity-based model to exploit sparse structures of the feature coefficient 
matrix. Noisy and uncorrelated features can be therefore removed. Since the ob¬ 
jective function is non-convex for all variables, we have proposed an algorithm 
with a guaranteed convergence property. To avoid to rapidly converge to a local 
optimum which is caused by K-means, we have applied an updating strategy to 
alleviate the problem. In this way, our proposed method might converge to the 
global optimum. Extensive experimental results have shown that our method has 
superior performance against all other compared approaches over six benchmark 
data sets. 
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